knitr::opts_chunk$set(
message = FALSE,
warning = FALSE,
tidy=FALSE, # display code as typed
size="small") # slightly smaller font for code
options(digits = 3)
# default figure size
knitr::opts_chunk$set(
fig.width=6.75,
fig.height=6.75,
fig.align = "center"
)Back in 2014, fivethiryeight.com published an article on alchohol consumption in different countries.
We have one column with character data which contains country names and four columns with numerical data which contains the amount of each type of alcohol consumed in these countries. There are no missing variables.
## Rows: 193
## Columns: 5
## $ country <chr> "Afghanistan", "Albania", "Algeria", "An…
## $ beer_servings <int> 0, 89, 25, 245, 217, 102, 193, 21, 261, …
## $ spirit_servings <int> 0, 132, 0, 138, 57, 128, 25, 179, 72, 75…
## $ wine_servings <int> 0, 54, 14, 312, 45, 45, 221, 11, 212, 19…
## $ total_litres_of_pure_alcohol <dbl> 0.0, 4.9, 0.7, 12.4, 5.9, 4.9, 8.3, 3.8,…
| Name | drinks |
| Number of rows | 193 |
| Number of columns | 5 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| country | 0 | 1 | 3 | 28 | 0 | 193 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| beer_servings | 0 | 1 | 106.16 | 101.14 | 0 | 20.0 | 76.0 | 188.0 | 376.0 | ▇▃▂▂▁ |
| spirit_servings | 0 | 1 | 80.99 | 88.28 | 0 | 4.0 | 56.0 | 128.0 | 438.0 | ▇▃▂▁▁ |
| wine_servings | 0 | 1 | 49.45 | 79.70 | 0 | 1.0 | 8.0 | 59.0 | 370.0 | ▇▁▁▁▁ |
| total_litres_of_pure_alcohol | 0 | 1 | 4.72 | 3.77 | 0 | 1.3 | 4.2 | 7.2 | 14.4 | ▇▃▅▃▁ |
desc_beer<- drinks %>%
arrange(desc(beer_servings))
top25_beer<-head(desc_beer,25)
ggplot(top25_beer,aes(x=reorder(country,-beer_servings),y=beer_servings)) + geom_col() + labs(x="Country", y="Beer Servings", title="Top 25 Beer Consuming Countries") + theme(axis.text.x=element_text(angle=90)) + NULLdesc_wine<- drinks %>%
arrange(desc(wine_servings))
top25_wine<-head(desc_wine,25)
ggplot(top25_wine,aes(x=reorder(country,-wine_servings),y=wine_servings)) + geom_col() + labs(x="Country", y="Wine Servings", title="Top 25 Wine Consuming Countries") + theme(axis.text.x=element_text(angle=90)) + NULLdesc_spirit<- drinks %>%
arrange(desc(spirit_servings))
top25_spirit<-head(desc_spirit,25)
ggplot(top25_spirit,aes(x=reorder(country,-spirit_servings),y=spirit_servings)) + geom_col() + labs(x="Country", y="Spirit Servings", title="Top 25 Spirit Consuming Countries") + theme(axis.text.x=element_text(angle=90)) + NULLIn the first plot we can see that Namibia has the highest yearly consumption of beer per capita. Beer is one of Namibia’s most iconic commodity and their top quality brewing is a legacy of German colonialism.The second most beer-loving country is Czech Republic, known for its unique brewing methods and prevalent drinking culture. The second plot, representing wine consumption, shows that French and Portugese drink the most wine. It is not very suprising given these are one of the most well known wine producing countries in the world. Last but not least, looking at the third plot we can see that Grenada and Belarus are the most spirit-loving countries. We can safely assume that Grenadians prefer to drink rum while Belarusians enjoy their vodka.
We will look at a subset sample of movies, taken from the Kaggle IMDB 5000 movie dataset
## Rows: 2,961
## Columns: 11
## $ title <chr> "Avatar", "Titanic", "Jurassic World", "The Aveng…
## $ genre <chr> "Action", "Drama", "Action", "Action", "Action", …
## $ director <chr> "James Cameron", "James Cameron", "Colin Trevorro…
## $ year <dbl> 2009, 1997, 2015, 2012, 2008, 1999, 1977, 2015, 2…
## $ duration <dbl> 178, 194, 124, 173, 152, 136, 125, 141, 164, 93, …
## $ gross <dbl> 7.61e+08, 6.59e+08, 6.52e+08, 6.23e+08, 5.33e+08,…
## $ budget <dbl> 2.37e+08, 2.00e+08, 1.50e+08, 2.20e+08, 1.85e+08,…
## $ cast_facebook_likes <dbl> 4834, 45223, 8458, 87697, 57802, 37723, 13485, 92…
## $ votes <dbl> 886204, 793059, 418214, 995415, 1676169, 534658, …
## $ reviews <dbl> 3777, 2843, 1934, 2425, 5312, 3917, 1752, 1752, 3…
## $ rating <dbl> 7.9, 7.7, 7.0, 8.1, 9.0, 6.5, 8.7, 7.5, 8.5, 7.2,…
| Name | movies |
| Number of rows | 2961 |
| Number of columns | 11 |
| _______________________ | |
| Column type frequency: | |
| character | 3 |
| numeric | 8 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| title | 0 | 1 | 1 | 83 | 0 | 2907 | 0 |
| genre | 0 | 1 | 5 | 11 | 0 | 17 | 0 |
| director | 0 | 1 | 3 | 32 | 0 | 1366 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| year | 0 | 1 | 2.00e+03 | 9.95e+00 | 1920.0 | 2.00e+03 | 2.00e+03 | 2.01e+03 | 2.02e+03 | ▁▁▁▂▇ |
| duration | 0 | 1 | 1.10e+02 | 2.22e+01 | 37.0 | 9.50e+01 | 1.06e+02 | 1.19e+02 | 3.30e+02 | ▃▇▁▁▁ |
| gross | 0 | 1 | 5.81e+07 | 7.25e+07 | 703.0 | 1.23e+07 | 3.47e+07 | 7.56e+07 | 7.61e+08 | ▇▁▁▁▁ |
| budget | 0 | 1 | 4.06e+07 | 4.37e+07 | 218.0 | 1.10e+07 | 2.60e+07 | 5.50e+07 | 3.00e+08 | ▇▂▁▁▁ |
| cast_facebook_likes | 0 | 1 | 1.24e+04 | 2.05e+04 | 0.0 | 2.24e+03 | 4.60e+03 | 1.69e+04 | 6.57e+05 | ▇▁▁▁▁ |
| votes | 0 | 1 | 1.09e+05 | 1.58e+05 | 5.0 | 1.99e+04 | 5.57e+04 | 1.33e+05 | 1.69e+06 | ▇▁▁▁▁ |
| reviews | 0 | 1 | 5.03e+02 | 4.94e+02 | 2.0 | 1.99e+02 | 3.64e+02 | 6.31e+02 | 5.31e+03 | ▇▁▁▁▁ |
| rating | 0 | 1 | 6.39e+00 | 1.05e+00 | 1.6 | 5.80e+00 | 6.50e+00 | 7.10e+00 | 9.30e+00 | ▁▁▆▇▁ |
Besides the obvious variables of title, genre, director, year, and duration, the rest of the variables are as follows:
gross : The gross earnings in the US box office, not adjusted for inflationbudget: The movie’s budgetcast_facebook_likes: the number of facebook likes cast memebrs receivedvotes: the number of people who voted for (or rated) the movie in IMDBreviews: the number of reviews for that movierating: IMDB average ratinggross and budget) by genre. Calculate a variable return_on_budget which shows how many $ did a movie make at the box office for each $ of its budget. Ranked genres by this return_on_budget in descending ordermovies <- read_csv(here::here("data", "movies.csv"))
movies %>%
group_by(genre) %>%
count(sort=TRUE)## # A tibble: 17 x 2
## # Groups: genre [17]
## genre n
## <chr> <int>
## 1 Comedy 848
## 2 Action 738
## 3 Drama 498
## 4 Adventure 288
## 5 Crime 202
## 6 Biography 135
## 7 Horror 131
## 8 Animation 35
## 9 Fantasy 28
## 10 Documentary 25
## 11 Mystery 16
## 12 Sci-Fi 7
## 13 Family 3
## 14 Musical 2
## 15 Romance 2
## 16 Western 2
## 17 Thriller 1
## # A tibble: 17 x 3
## genre `mean(gross)` `mean(budget)`
## <chr> <dbl> <dbl>
## 1 Action 86583860. 71354888.
## 2 Adventure 95794257. 66290069.
## 3 Animation 98433792. 61701429.
## 4 Biography 45201805. 28543696.
## 5 Comedy 42630552. 24446319.
## 6 Crime 37502397. 26596169.
## 7 Documentary 17353973. 5887852.
## 8 Drama 37465371. 26242933.
## 9 Family 149160478. 14833333.
## 10 Fantasy 42408841. 17582143.
## 11 Horror 37713738. 13504916.
## 12 Musical 92084000 3189500
## 13 Mystery 67533021. 39218750
## 14 Romance 31264848. 25107500
## 15 Sci-Fi 29788371. 27607143.
## 16 Thriller 2468 300000
## 17 Western 20821884 3465000
movies_by_ratio <- movies %>%
group_by(genre)
summarise(movies_by_ratio, mean_gross=mean(gross), mean_budget=mean(budget), return_on_budget=mean_gross/mean_budget) %>%
arrange(desc(return_on_budget))## # A tibble: 17 x 4
## genre mean_gross mean_budget return_on_budget
## <chr> <dbl> <dbl> <dbl>
## 1 Musical 92084000 3189500 28.9
## 2 Family 149160478. 14833333. 10.1
## 3 Western 20821884 3465000 6.01
## 4 Documentary 17353973. 5887852. 2.95
## 5 Horror 37713738. 13504916. 2.79
## 6 Fantasy 42408841. 17582143. 2.41
## 7 Comedy 42630552. 24446319. 1.74
## 8 Mystery 67533021. 39218750 1.72
## 9 Animation 98433792. 61701429. 1.60
## 10 Biography 45201805. 28543696. 1.58
## 11 Adventure 95794257. 66290069. 1.45
## 12 Drama 37465371. 26242933. 1.43
## 13 Crime 37502397. 26596169. 1.41
## 14 Romance 31264848. 25107500 1.25
## 15 Action 86583860. 71354888. 1.21
## 16 Sci-Fi 29788371. 27607143. 1.08
## 17 Thriller 2468 300000 0.00823
movies_director <- movies %>%
group_by(director) %>%
summarise(total_gross = sum(gross, na.rm=TRUE), mean_gross=mean(gross, na.rm=TRUE), median_gross=median(gross,na.rm=TRUE),sd_gross=sd(gross, na.rm=TRUE),count = n()) %>%
arrange(desc(total_gross))
top15_director <- head(movies_director,15)
top15_director## # A tibble: 15 x 6
## director total_gross mean_gross median_gross sd_gross count
## <chr> <dbl> <dbl> <dbl> <dbl> <int>
## 1 Steven Spielberg 4014061704 174524422. 164435221 101421051. 23
## 2 Michael Bay 2231242537 171634041. 138396624 127161579. 13
## 3 Tim Burton 2071275480 129454718. 76519172 108726924. 16
## 4 Sam Raimi 2014600898 201460090. 234903076 162126632. 10
## 5 James Cameron 1909725910 318287652. 175562880. 309171337. 6
## 6 Christopher Nolan 1813227576 226653447 196667606. 187224133. 8
## 7 George Lucas 1741418480 348283696 380262555 146193880. 5
## 8 Robert Zemeckis 1619309108 124562239. 100853835 91300279. 13
## 9 Clint Eastwood 1378321100 72543216. 46700000 75487408. 19
## 10 Francis Lawrence 1358501971 271700394. 281666058 135437020. 5
## 11 Ron Howard 1335988092 111332341 101587923 81933761. 12
## 12 Gore Verbinski 1329600995 189942999. 123207194 154473822. 7
## 13 Andrew Adamson 1137446920 284361730 279680930. 120895765. 4
## 14 Shawn Levy 1129750988 102704635. 85463309 65484773. 11
## 15 Ridley Scott 1128857598 80632686. 47775715 68812285. 14
movies_ratings <- movies %>%
group_by(genre) %>%
summarise(mean_rating=mean(rating, na.rm=TRUE), median_rating=median(rating, na.rm=TRUE), max_rating=max(rating, na.rm=TRUE), min_rating=min(rating, na.rm=TRUE), sd_rating=sd(rating, na.rm=TRUE))
movies_ratings## # A tibble: 17 x 6
## genre mean_rating median_rating max_rating min_rating sd_rating
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Action 6.23 6.3 9 2.1 1.03
## 2 Adventure 6.51 6.6 8.6 2.3 1.09
## 3 Animation 6.65 6.9 8 4.5 0.968
## 4 Biography 7.11 7.2 8.9 4.5 0.760
## 5 Comedy 6.11 6.2 8.8 1.9 1.02
## 6 Crime 6.92 6.9 9.3 4.8 0.849
## 7 Documentary 6.66 7.4 8.5 1.6 1.77
## 8 Drama 6.73 6.8 8.8 2.1 0.917
## 9 Family 6.5 5.9 7.9 5.7 1.22
## 10 Fantasy 6.15 6.45 7.9 4.3 0.959
## 11 Horror 5.83 5.9 8.5 3.6 1.01
## 12 Musical 6.75 6.75 7.2 6.3 0.636
## 13 Mystery 6.86 6.9 8.5 4.6 0.882
## 14 Romance 6.65 6.65 7.1 6.2 0.636
## 15 Sci-Fi 6.66 6.4 8.2 5 1.09
## 16 Thriller 4.8 4.8 4.8 4.8 NA
## 17 Western 5.70 5.70 7.3 4.1 2.26
ggplot(data = movies, mapping = aes(x = rating)) + geom_histogram() + scale_y_log10() + facet_wrap(~ genre) ## Use
ggplot to answer the following
gross and cast_facebook_likes. Produce a scatterplot and write one sentence discussing whether the number of facebook likes that the cast has received is likely to be a good predictor of how much money a movie will make at the box office. What variable are you going to map to the Y- and X- axes?ggplot(data = movies, mapping = aes(x = cast_facebook_likes, y = gross)) + geom_point() + geom_smooth(method=lm) + scale_x_log10() + scale_y_log10() + NULLgross and budget. Produce a scatterplot and write one sentence discussing whether budget is likely to be a good predictor of how much money a movie will make at the box office.ggplot(data = movies, mapping = aes(x = budget, y = gross)) + geom_point() + geom_smooth(method=lm) + scale_x_log10() + scale_y_log10() + NULLgross and rating. Produce a scatterplot, faceted by genre and discuss whether IMDB ratings are likely to be a good predictor of how much money a movie will make at the box office. Is there anything strange in this dataset?ggplot(data = movies, mapping = aes(x = rating, y = gross)) + geom_point() + geom_smooth(method=lm) + scale_x_log10() + scale_y_log10() + NULLYou may find useful the material on finance data sources.
We will use the tidyquant package to download historical data of stock prices, calculate returns, and examine the distribution of returns.
We must first identify which stocks we want to download data for, and for this we must know their ticker symbol; Apple is known as AAPL, Microsoft as MSFT, McDonald’s as MCD, etc. The file nyse.csv contains 508 stocks listed on the NYSE, their ticker symbol, name, the IPO (Initial Public Offering) year, and the sector and industry the company is in.
## Rows: 508
## Columns: 6
## $ symbol <chr> "MMM", "ABB", "ABT", "ABBV", "ACN", "AAP", "AFL", "A", …
## $ name <chr> "3M Company", "ABB Ltd", "Abbott Laboratories", "AbbVie…
## $ ipo_year <chr> "n/a", "n/a", "n/a", "2012", "2001", "n/a", "n/a", "199…
## $ sector <chr> "Health Care", "Consumer Durables", "Health Care", "Hea…
## $ industry <chr> "Medical/Dental Instruments", "Electrical Products", "M…
## $ summary_quote <chr> "https://www.nasdaq.com/symbol/mmm", "https://www.nasda…
| Name | nyse |
| Number of rows | 508 |
| Number of columns | 6 |
| _______________________ | |
| Column type frequency: | |
| character | 6 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| symbol | 0 | 1 | 1 | 4 | 0 | 508 | 0 |
| name | 0 | 1 | 5 | 48 | 0 | 505 | 0 |
| ipo_year | 0 | 1 | 3 | 4 | 0 | 33 | 0 |
| sector | 0 | 1 | 6 | 21 | 0 | 12 | 0 |
| industry | 0 | 1 | 5 | 62 | 0 | 103 | 0 |
| summary_quote | 0 | 1 | 31 | 34 | 0 | 508 | 0 |
Based on this dataset, create a table and a bar plot that shows the number of companies per sector, in descending order
companies_per_sector <- nyse %>%
group_by(sector)%>%
summarise(Number_of_Companies = n()) %>%
arrange(desc(Number_of_Companies))
ggplot(data = companies_per_sector, mapping = aes(x=Number_of_Companies, y=reorder(sector,Number_of_Companies))) + labs(title="Sectors with number of companies in it \n",y="", x="Number of Companies") + geom_col() + theme_economist() + scale_x_continuous(expand = c(0,0))## # A tibble: 12 x 2
## sector Number_of_Companies
## <chr> <int>
## 1 Finance 97
## 2 Consumer Services 79
## 3 Public Utilities 60
## 4 Capital Goods 45
## 5 Health Care 45
## 6 Energy 42
## 7 Technology 40
## 8 Basic Industries 39
## 9 Consumer Non-Durables 31
## 10 Miscellaneous 12
## 11 Transportation 10
## 12 Consumer Durables 8
Next, let’s choose some stocks and their ticker symbols and download some data. You MUST choose 6 different stocks from the ones listed below; You should, however, add SPY which is the SP500 ETF (Exchange Traded Fund).
# Notice the cache=TRUE argument inthe chunk options. Because getting data is time consuming,
# cache=TRUE means that once it downloads data, the chunk will not run again next time you knit your Rmd
myStocks <- c("AAPL","JPM","DIS","DPZ","ANF","TSLA","XOM","SPY" ) %>%
tq_get(get = "stock.prices",
from = "2011-01-01",
to = "2020-08-31") %>%
group_by(symbol)
glimpse(myStocks) # examine the structure of the resulting data frame## Rows: 19,448
## Columns: 8
## Groups: symbol [8]
## $ symbol <chr> "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAP…
## $ date <date> 2011-01-03, 2011-01-04, 2011-01-05, 2011-01-06, 2011-01-07,…
## $ open <dbl> 11.6, 11.9, 11.8, 12.0, 11.9, 12.1, 12.3, 12.3, 12.3, 12.4, …
## $ high <dbl> 11.8, 11.9, 11.9, 12.0, 12.0, 12.3, 12.3, 12.3, 12.4, 12.4, …
## $ low <dbl> 11.6, 11.7, 11.8, 11.9, 11.9, 12.0, 12.1, 12.2, 12.3, 12.3, …
## $ close <dbl> 11.8, 11.8, 11.9, 11.9, 12.0, 12.2, 12.2, 12.3, 12.3, 12.4, …
## $ volume <dbl> 4.45e+08, 3.09e+08, 2.56e+08, 3.00e+08, 3.12e+08, 4.49e+08, …
## $ adjusted <dbl> 10.2, 10.2, 10.3, 10.3, 10.4, 10.6, 10.5, 10.6, 10.7, 10.8, …
Financial performance analysis depend on returns; If I buy a stock today for 100 and I sell it tomorrow for 101.75, my one-day return, assuming no transaction costs, is 1.75%. So given the adjusted closing prices, our first step is to calculate daily and monthly returns.
#calculate daily returns
myStocks_returns_daily <- myStocks %>%
tq_transmute(select = adjusted,
mutate_fun = periodReturn,
period = "daily",
type = "log",
col_rename = "daily_returns",
cols = c(nested.col))
#calculate monthly returns
myStocks_returns_monthly <- myStocks %>%
tq_transmute(select = adjusted,
mutate_fun = periodReturn,
period = "monthly",
type = "arithmetic",
col_rename = "monthly_returns",
cols = c(nested.col))
#calculate yearly returns
myStocks_returns_annual <- myStocks %>%
group_by(symbol) %>%
tq_transmute(select = adjusted,
mutate_fun = periodReturn,
period = "yearly",
type = "arithmetic",
col_rename = "yearly_returns",
cols = c(nested.col))Create a table where you summarise monthly returns for each of the stocks and SPY; min, max, median, mean, SD.
monthly_R <- myStocks_returns_monthly %>%
summarise(mean_return=mean(monthly_returns, na.rm=TRUE), median_return=median(monthly_returns, na.rm=TRUE), max_return=max(monthly_returns, na.rm=TRUE), min_return=min(monthly_returns, na.rm=TRUE), sd_return=sd(monthly_returns, na.rm=TRUE))
monthly_R## # A tibble: 8 x 6
## symbol mean_return median_return max_return min_return sd_return
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 AAPL 0.0249 0.0257 0.200 -0.181 0.0782
## 2 ANF -0.000830 -0.000426 0.507 -0.421 0.139
## 3 DIS 0.0142 0.0114 0.234 -0.179 0.0646
## 4 DPZ 0.0324 0.0253 0.342 -0.188 0.0746
## 5 JPM 0.0124 0.0210 0.172 -0.229 0.0719
## 6 SPY 0.0112 0.0146 0.127 -0.125 0.0381
## 7 TSLA 0.0513 0.0148 0.811 -0.224 0.171
## 8 XOM -0.000410 0.000105 0.224 -0.262 0.0591
Plot a density plot, using geom_density(), for each of the stocks
ggplot(data = myStocks_returns_monthly, mapping = aes(x=monthly_returns)) + labs(title="Monthly returns \n",y="", x="") + geom_density() + theme_economist() + facet_wrap(~ symbol)ggplot(myStocks_returns_monthly, aes(x=monthly_returns)) +
geom_density(fill = "darkgreen", alpha = 1) +
labs(title = "Monthly returns of portfolio since 01 January 2017", x = "Monthly returns", y = "Density") +
scale_y_continuous(breaks = scales::pretty_breaks(n = 8)) +
facet_wrap(~symbol, scales='fixed') +
theme_clean()What can you infer from this plot? Which stock is the riskiest? The least risky?
TYPE YOUR ANSWER AFTER (AND OUTSIDE!) THIS BLOCKQUOTE.
Finally, make a plot that shows the expected monthly return (mean) of a stock on the Y axis and the risk (standard deviation) in the X-axis. Please use ggrepel::geom_text_repel() to label each stock
library(ggrepel)
g_risk_return <- ggplot(data = monthly_R, mapping = aes(y=mean_return, x=sd_return, label = symbol)) + geom_point(color = "red")
g_2 <- g_risk_return + geom_text_repel() + labs(title = "Return vs. Risk", x="risk", y="return")
g_2ggplot(monthly_R, aes(x = sd_return, y = mean_return, colour = symbol)) +
geom_point() +
ggrepel::geom_text_repel(aes(label = symbol)) +
labs(title = "Expected Monthly Return of Portfolio Stocks",
y = "Expected return (Mean)", x = "Risk (SD)") +
theme_clean() +
theme(legend.position = 'none')What can you infer from this plot? Are there any stocks which, while being riskier, do not have a higher expected return?
TYPE YOUR ANSWER AFTER (AND OUTSIDE!) THIS BLOCKQUOTE.
For this task, you will analyse a data set on Human Resoruce Analytics. The IBM HR Analytics Employee Attrition & Performance data set is a fictional data set created by IBM data scientists. Among other things, the data set includes employees’ income, their distance from work, their position in the company, their level of education, etc. A full description can be found on the website.
First let us load the data
hr_dataset <- read_csv(here::here("data", "datasets_1067_1925_WA_Fn-UseC_-HR-Employee-Attrition.csv"))
glimpse(hr_dataset)## Rows: 1,470
## Columns: 35
## $ Age <dbl> 41, 49, 37, 33, 27, 32, 59, 30, 38, 36, 35, …
## $ Attrition <chr> "Yes", "No", "Yes", "No", "No", "No", "No", …
## $ BusinessTravel <chr> "Travel_Rarely", "Travel_Frequently", "Trave…
## $ DailyRate <dbl> 1102, 279, 1373, 1392, 591, 1005, 1324, 1358…
## $ Department <chr> "Sales", "Research & Development", "Research…
## $ DistanceFromHome <dbl> 1, 8, 2, 3, 2, 2, 3, 24, 23, 27, 16, 15, 26,…
## $ Education <dbl> 2, 1, 2, 4, 1, 2, 3, 1, 3, 3, 3, 2, 1, 2, 3,…
## $ EducationField <chr> "Life Sciences", "Life Sciences", "Other", "…
## $ EmployeeCount <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ EmployeeNumber <dbl> 1, 2, 4, 5, 7, 8, 10, 11, 12, 13, 14, 15, 16…
## $ EnvironmentSatisfaction <dbl> 2, 3, 4, 4, 1, 4, 3, 4, 4, 3, 1, 4, 1, 2, 3,…
## $ Gender <chr> "Female", "Male", "Male", "Female", "Male", …
## $ HourlyRate <dbl> 94, 61, 92, 56, 40, 79, 81, 67, 44, 94, 84, …
## $ JobInvolvement <dbl> 3, 2, 2, 3, 3, 3, 4, 3, 2, 3, 4, 2, 3, 3, 2,…
## $ JobLevel <dbl> 2, 2, 1, 1, 1, 1, 1, 1, 3, 2, 1, 2, 1, 1, 1,…
## $ JobRole <chr> "Sales Executive", "Research Scientist", "La…
## $ JobSatisfaction <dbl> 4, 2, 3, 3, 2, 4, 1, 3, 3, 3, 2, 3, 3, 4, 3,…
## $ MaritalStatus <chr> "Single", "Married", "Single", "Married", "M…
## $ MonthlyIncome <dbl> 5993, 5130, 2090, 2909, 3468, 3068, 2670, 26…
## $ MonthlyRate <dbl> 19479, 24907, 2396, 23159, 16632, 11864, 996…
## $ NumCompaniesWorked <dbl> 8, 1, 6, 1, 9, 0, 4, 1, 0, 6, 0, 0, 1, 0, 5,…
## $ Over18 <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y",…
## $ OverTime <chr> "Yes", "No", "Yes", "Yes", "No", "No", "Yes"…
## $ PercentSalaryHike <dbl> 11, 23, 15, 11, 12, 13, 20, 22, 21, 13, 13, …
## $ PerformanceRating <dbl> 3, 4, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3, 3, 3,…
## $ RelationshipSatisfaction <dbl> 1, 4, 2, 3, 4, 3, 1, 2, 2, 2, 3, 4, 4, 3, 2,…
## $ StandardHours <dbl> 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, …
## $ StockOptionLevel <dbl> 0, 1, 0, 0, 1, 0, 3, 1, 0, 2, 1, 0, 1, 1, 0,…
## $ TotalWorkingYears <dbl> 8, 10, 7, 8, 6, 8, 12, 1, 10, 17, 6, 10, 5, …
## $ TrainingTimesLastYear <dbl> 0, 3, 3, 3, 3, 2, 3, 2, 2, 3, 5, 3, 1, 2, 4,…
## $ WorkLifeBalance <dbl> 1, 3, 3, 3, 3, 2, 2, 3, 3, 2, 3, 3, 2, 3, 3,…
## $ YearsAtCompany <dbl> 6, 10, 0, 8, 2, 7, 1, 1, 9, 7, 5, 9, 5, 2, 4…
## $ YearsInCurrentRole <dbl> 4, 7, 0, 7, 2, 7, 0, 0, 7, 7, 4, 5, 2, 2, 2,…
## $ YearsSinceLastPromotion <dbl> 0, 1, 0, 3, 2, 3, 0, 0, 1, 7, 0, 0, 4, 1, 0,…
## $ YearsWithCurrManager <dbl> 5, 7, 0, 0, 2, 6, 0, 0, 8, 7, 3, 8, 3, 2, 3,…
I am going to clean the data set, as variable names are in capital letters, some variables are not really necessary, and some variables, e.g., education are given as a number rather than a more useful description
hr_cleaned <- hr_dataset %>%
clean_names() %>%
mutate(
education = case_when(
education == 1 ~ "Below College",
education == 2 ~ "College",
education == 3 ~ "Bachelor",
education == 4 ~ "Master",
education == 5 ~ "Doctor"
),
environment_satisfaction = case_when(
environment_satisfaction == 1 ~ "Low",
environment_satisfaction == 2 ~ "Medium",
environment_satisfaction == 3 ~ "High",
environment_satisfaction == 4 ~ "Very High"
),
job_satisfaction = case_when(
job_satisfaction == 1 ~ "Low",
job_satisfaction == 2 ~ "Medium",
job_satisfaction == 3 ~ "High",
job_satisfaction == 4 ~ "Very High"
),
performance_rating = case_when(
performance_rating == 1 ~ "Low",
performance_rating == 2 ~ "Good",
performance_rating == 3 ~ "Excellent",
performance_rating == 4 ~ "Outstanding"
),
work_life_balance = case_when(
work_life_balance == 1 ~ "Bad",
work_life_balance == 2 ~ "Good",
work_life_balance == 3 ~ "Better",
work_life_balance == 4 ~ "Best"
)
) %>%
select(age, attrition, daily_rate, department,
distance_from_home, education,
gender, job_role,environment_satisfaction,
job_satisfaction, marital_status,
monthly_income, num_companies_worked, percent_salary_hike,
performance_rating, total_working_years,
work_life_balance, years_at_company,
years_since_last_promotion)## # A tibble: 2 x 2
## `attrition == "Yes"` n
## <lgl> <int>
## 1 FALSE 1233
## 2 TRUE 237
distribution_keyvalues <- hr_cleaned %>%
summarise(mean_age=mean(age, na.rm=TRUE), median_age=median(age, na.rm=TRUE), max_age=max(age, na.rm=TRUE), min_age=min(age, na.rm=TRUE), sd_age=sd(age, na.rm=TRUE), mean_years_at_company=mean(years_at_company, na.rm=TRUE), median_years_at_company=median(years_at_company, na.rm=TRUE), max_years_at_company=max(years_at_company, na.rm=TRUE), min_years_at_company=min(years_at_company, na.rm=TRUE), sd_years_at_company=sd(years_at_company, na.rm=TRUE), mean_monthly_income=mean(monthly_income, na.rm=TRUE), median_monthly_income=median(monthly_income, na.rm=TRUE), max_monthly_income=max(monthly_income, na.rm=TRUE), min_monthly_income=min(monthly_income, na.rm=TRUE), sd_monthly_income=sd(monthly_income, na.rm=TRUE), mean_years_since_last_promotion=mean(years_since_last_promotion, na.rm=TRUE), median_years_since_last_promotion=median(years_since_last_promotion, na.rm=TRUE), max_years_since_last_promotion=max(years_since_last_promotion, na.rm=TRUE), min_years_since_last_promotion=min(years_since_last_promotion, na.rm=TRUE), sd_years_since_last_promotion=sd(years_since_last_promotion, na.rm=TRUE))
distribution_keyvalues## # A tibble: 1 x 20
## mean_age median_age max_age min_age sd_age mean_years_at_c… median_years_at…
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 36.9 36 60 18 9.14 7.01 5
## # … with 13 more variables: max_years_at_company <dbl>,
## # min_years_at_company <dbl>, sd_years_at_company <dbl>,
## # mean_monthly_income <dbl>, median_monthly_income <dbl>,
## # max_monthly_income <dbl>, min_monthly_income <dbl>,
## # sd_monthly_income <dbl>, mean_years_since_last_promotion <dbl>,
## # median_years_since_last_promotion <dbl>,
## # max_years_since_last_promotion <dbl>, min_years_since_last_promotion <dbl>,
## # sd_years_since_last_promotion <dbl>
ggplot(hr_cleaned, aes(x=age)) +
labs(title = "Boxplot - Age Distribution", x = "Age") +
geom_boxplot() +
theme_classic()ggplot(hr_cleaned,aes(x=years_at_company)) +
labs(title = "Boxplot - Company tenure", x = "Years at company") +
geom_boxplot() +
theme_classic()ggplot(hr_cleaned,aes(x=monthly_income)) +
labs(title = "Boxplot - Monthly income", x = "Income") +
geom_boxplot() +
theme_classic()ggplot(hr_cleaned,aes(x=years_since_last_promotion)) +
labs(title = "Boxplot - Years since last promotion", x = "Years since last promotion") +
geom_boxplot() +
theme_classic()count(hr_cleaned, job_satisfaction == "Low", job_satisfaction == "Medium", job_satisfaction == "High", job_satisfaction == "Very High")## # A tibble: 4 x 5
## `job_satisfaction… `job_satisfactio… `job_satisfactio… `job_satisfactio… n
## <lgl> <lgl> <lgl> <lgl> <int>
## 1 FALSE FALSE FALSE TRUE 459
## 2 FALSE FALSE TRUE FALSE 442
## 3 FALSE TRUE FALSE FALSE 280
## 4 TRUE FALSE FALSE FALSE 289
job_satis <- hr_cleaned %>%
group_by(job_satisfaction) %>%
count() %>%
ungroup() %>%
mutate(pct_job_satisfaction = 100*n/sum(n))
job_satis## # A tibble: 4 x 3
## job_satisfaction n pct_job_satisfaction
## <chr> <int> <dbl>
## 1 High 442 30.1
## 2 Low 289 19.7
## 3 Medium 280 19.0
## 4 Very High 459 31.2
ggplot(data = job_satis, mapping =aes(x = job_satisfaction, y = pct_job_satisfaction)) + geom_col() + NULLcount(hr_cleaned, work_life_balance == "Bad", work_life_balance == "Good", work_life_balance == "Better", work_life_balance == "Best")## # A tibble: 4 x 5
## `work_life_balanc… `work_life_balan… `work_life_balan… `work_life_balan… n
## <lgl> <lgl> <lgl> <lgl> <int>
## 1 FALSE FALSE FALSE TRUE 153
## 2 FALSE FALSE TRUE FALSE 893
## 3 FALSE TRUE FALSE FALSE 344
## 4 TRUE FALSE FALSE FALSE 80
hr_cleaned %>%
group_by(work_life_balance) %>%
count() %>%
ungroup() %>%
mutate(pct_work_life_balance = 100*n/sum(n))## # A tibble: 4 x 3
## work_life_balance n pct_work_life_balance
## <chr> <int> <dbl>
## 1 Bad 80 5.44
## 2 Best 153 10.4
## 3 Better 893 60.7
## 4 Good 344 23.4
ggplot(data = hr_cleaned, mapping = aes(x = monthly_income)) + geom_density() + facet_wrap(~ education) + NULLggplot(data = hr_cleaned, mapping = aes(x = monthly_income)) + geom_density() + facet_wrap(~ gender) + NULLincome_education <- hr_cleaned %>%
group_by(education) %>%
summarise(mean_income=mean(monthly_income, na.rm=TRUE), median_income=median(monthly_income, na.rm=TRUE), max_income=max(monthly_income, na.rm=TRUE), min_income=min(monthly_income, na.rm=TRUE), sd_income=sd(monthly_income, na.rm=TRUE))
income_education## # A tibble: 5 x 6
## education mean_income median_income max_income min_income sd_income
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Bachelor 6517. 4762 19926 1081 4817.
## 2 Below College 5641. 3849 19973 1009 4485.
## 3 College 6227. 4892. 19613 1051 4525.
## 4 Doctor 8278. 6203 19586 2127 5061.
## 5 Master 6832. 5342. 19999 1359 4657.
income_gender <- hr_cleaned %>%
group_by(gender) %>%
summarise(mean_income=mean(monthly_income, na.rm=TRUE), median_income=median(monthly_income, na.rm=TRUE), max_income=max(monthly_income, na.rm=TRUE), min_income=min(monthly_income, na.rm=TRUE), sd_income=sd(monthly_income, na.rm=TRUE))
income_gender## # A tibble: 2 x 6
## gender mean_income median_income max_income min_income sd_income
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Female 6687. 5082. 19973 1129 4696.
## 2 Male 6381. 4838. 19999 1009 4715.
ggplot(data = hr_cleaned, mapping = aes(x = reorder(factor(job_role), -monthly_income), y = monthly_income)) + geom_boxplot() + theme(axis.text.x = element_text(angle = 90)) + NULLincome_bar_education <- hr_cleaned %>%
group_by(education) %>%
summarise(mean_income=mean(monthly_income, na.rm=TRUE))
income_bar_education## # A tibble: 5 x 2
## education mean_income
## <chr> <dbl>
## 1 Bachelor 6517.
## 2 Below College 5641.
## 3 College 6227.
## 4 Doctor 8278.
## 5 Master 6832.
ggplot(data = hr_cleaned, mapping = aes(x = monthly_income)) + geom_density() + facet_wrap(~ education) + theme_solarized() + NULLggplot(data = hr_cleaned, mapping = aes(x = age, y = monthly_income)) + geom_point() + geom_smooth(method=lm) + facet_wrap(~ job_role) + NULLProduce a one-page summary describing this dataset. Here is a non-exhaustive list of questions:
attrition)age, years_at_company, monthly_income and years_since_last_promotion distributed? can you roughly guess which of these variables is closer to Normal just by looking at summary statistics?job_satisfaction and work_life_balance distributed? Don’t just report counts, but express categories as % of totalggthemesjob_roleThe purpose of this exercise is to make a publication-ready plot using your dplyr and ggplot2 skills. Open the journal article “Riddell_Annals_Hom-Sui-Disparities.pdf”. Read the abstract and have a look at Figure 3. The data you need is “CDC_Males.csv”.
Don’t worry about replicating it exactly, try and see how far you can get. You’re encouraged to work together if you want to and exchange tips/tricks you figured out.
You may find these helpful:
## Rows: 104
## Columns: 28
## $ ST <chr> "AK", "AK", "AL", "AL", "AR", "AR", "AZ", "AZ…
## $ State <chr> "Alaska", "Alaska", "Alabama", "Alabama", "Ar…
## $ Population.Black <dbl> 153664, 153664, 5376635, 5376635, 1991165, 19…
## $ Population.White <dbl> 2274605, 2274605, 14267619, 14267619, 9747025…
## $ Deaths.homicide.Black <dbl> 23, NA, 183, 1812, 667, 90, 300, 40, 3712, 43…
## $ Deaths.homicide.White <dbl> 78, 10, 124, 620, 416, 110, 586, 145, 1399, 6…
## $ crude.homicide.Black <dbl> 14.97, NA, 3.40, 33.70, 33.50, 4.52, 22.06, 2…
## $ crude.homicide.White <dbl> 3.43, NA, 0.87, 4.35, 4.27, 1.13, 3.46, 0.86,…
## $ adjusted.homicide.Black <dbl> 12.30, NA, 3.51, 33.00, 33.39, 4.81, 20.40, 2…
## $ adjusted.homicide.White <dbl> 3.24, NA, 0.85, 4.47, 4.39, 1.13, 3.63, 0.91,…
## $ Deaths.suicide.Black <dbl> 23, 12, 148, 370, 118, 75, 101, 80, 492, 639,…
## $ Deaths.suicide.White <dbl> 535, 196, 1222, 3195, 2154, 1027, 4146, 1972,…
## $ crude.suicide.Black <dbl> 14.97, NA, 2.75, 6.88, 5.93, 3.77, 7.43, 5.88…
## $ crude.suicide.White <dbl> 23.52, 8.62, 8.56, 22.39, 22.10, 10.54, 24.47…
## $ adjusted.suicide.Black <dbl> 12.77, NA, 2.84, 7.20, 6.25, 3.93, 7.51, 5.58…
## $ adjusted.suicide.White <dbl> 23.47, 8.21, 8.64, 20.97, 20.98, 10.89, 21.42…
## $ type <chr> "Firearm", "Non-Firearm", "Non-Firearm", "Fir…
## $ crude.RD.suicide <dbl> -8.55, NA, -5.81, -15.51, -16.17, -6.77, -17.…
## $ adj.RD.suicide <dbl> -10.70, NA, -5.80, -13.77, -14.73, -6.96, -13…
## $ crude.RD.homicide <dbl> 11.54, NA, 2.53, 29.35, 29.23, 3.39, 18.60, 2…
## $ adj.RD.homicide <dbl> 9.06, NA, 2.66, 28.53, 29.00, 3.68, 16.77, 1.…
## $ ST.order.RD.homicide <chr> "AK", "AK", "AL", "AL", "AR", "AR", "AZ", "AZ…
## $ ST.order.RD.suicide <chr> "AK", "AK", "AL", "AL", "AR", "AR", "AZ", "AZ…
## $ gun.house.prev <dbl> 59.8, 59.8, 52.2, 52.2, 58.8, 58.8, 32.3, 32.…
## $ gun.house.prev.category <chr> "45.0%-65.5%", "45.0%-65.5%", "45.0%-65.5%", …
## $ average.pop.white <dbl> 252734, 252734, 1585291, 1585291, 1083003, 10…
## $ average.pop.black <dbl> 17074, 17074, 597404, 597404, 221241, 221241,…
## $ type.fac <chr> "Firearm-related", "Firearm-unrelated", "Fire…
## Rows: 50
## Columns: 3
## $ ST <chr> "AK", "AL", "AR", "AZ", "CA", "CO", "CT", "DE…
## $ average.pop.white <dbl> 252734, 1585291, 1083003, 1882345, 7703022, 1…
## $ gun.house.prev.category <chr> "45.0%-65.5%", "45.0%-65.5%", "45.0%-65.5%", …
## Rows: 52
## Columns: 3
## $ ST <chr> "AK", "AL", "AR", "AZ", "CA", "CO", "CT", "DC…
## $ adjusted.homicide.White <dbl> 3.24, 4.47, 4.39, 3.63, 2.05, 1.75, 0.86, NA,…
## $ adjusted.suicide.White <dbl> 23.47, 20.97, 20.98, 21.42, 11.63, 19.48, 6.4…
## Rows: 50
## Columns: 5
## $ ST <chr> "AK", "AL", "AR", "AZ", "CA", "CO", "CT", "DE…
## $ adjusted.homicide.White <dbl> 3.24, 4.47, 4.39, 3.63, 2.05, 1.75, 0.86, 1.7…
## $ adjusted.suicide.White <dbl> 23.47, 20.97, 20.98, 21.42, 11.63, 19.48, 6.4…
## $ average.pop.white <dbl> 252734, 1585291, 1083003, 1882345, 7703022, 1…
## $ gun.house.prev.category <chr> "45.0%-65.5%", "45.0%-65.5%", "45.0%-65.5%", …
As discussed in class, I would like you to reproduce the plot that shows the top ten cities in highest amounts raised in political contributions in California during the 2016 US Presidential election.
To get this plot, you must join two dataframes; the one you have with all contributions, and data that can translate zipcodes to cities. You can find a file with all US zipcodes, e.g., here http://www.uszipcodelist.com/download.html.
The easiest way would be to create two plots and then place one next to each other. For this, you will need the patchwork package. https://cran.r-project.org/web/packages/patchwork/index.html
While this is ok, what if one asked you to create the same plot for the top 10 candidates and not just the top two? The most challenging part is how to reorder within categories, and for this you will find Julia Silge’s post on REORDERING AND FACETTING FOR GGPLOT2 useful.
There is a lot of explanatory text, comments, etc. You do not need these, so delete them and produce a stand-alone document that you could share with someone. Knit the edited and completed R Markdown file as an HTML document (use the “Knit” button at the top of the script editor window) and upload it to Canvas.
Please seek out help when you need it, and remember the 15-minute rule. You know enough R (and have enough examples of code from class and your readings) to be able to do this. If you get stuck, ask for help from others, post a question on Slack– and remember that I am here to help too!
As a true test to yourself, do you understand the code you submitted and are you able to explain it to someone else?
Check minus (1/5): Displays minimal effort. Doesn’t complete all components. Code is poorly written and not documented. Uses the same type of plot for each graph, or doesn’t use plots appropriate for the variables being analyzed.
Check (3/5): Solid effort. Hits all the elements. No clear mistakes. Easy to follow (both the code and the output).
Check plus (5/5): Finished all components of the assignment correctly and addressed both challenges. Code is well-documented (both self-documented and with additional comments as necessary). Used tidyverse, instead of base R. Graphs and tables are properly labelled. Analysis is clear and easy to follow, either because graphs are labeled clearly or you’ve written additional text to describe how you interpret the output.